Web pages, text types, and linguistic features: Some issues

نویسنده

  • Marina Santini
چکیده

1 Introduction With the growth of the Web a massive quantity of documents, namely web pages, are freely available for (corpus-)linguistic studies. Web pages can be considered as a new kind of document, much more unpredictable and individualized than paper documents. While the linear organization of most paper documents is still reflected in traditional electronic corpora, such as the British National Corpus (BNC), web pages have a visual organization that allows the inclusion of several functions or several texts with different communicative purposes in a single document. For example, the space on a web page can be divided into different sections, organized by lists of links – mainly isolated noun structures or verbal elements (Haas and Grams 2000: 186–187) – and snippets of text scattered around the main body of the document, such as navigational buttons, menus, ads, and search boxes, that are visually dislocated in different areas of a single page. Additionally, the effect of hyperlinking (Haas and Grams 1998; Crowston and Williams 1999), interactivity and multi-functionality (Shepherd and Watters 1999) can affect the textuality of web pages, which heavily rely also on the use of images and other graphical elements. Although the use of fonts of different types, sizes, and colours, as well as the use of formatting devices, like columns, lines separating different sections of a document, pictures, etc. is not new (cf. Waller 1987 for a detailed description of the role of both language and typography in the formation of document types), a newspaper article organized in columns and headlines does not lose its specific linguistic and textual characteristics when it is included in a corpus like the BNC. The same is not true for many web pages, because the visual structure of a web page incorporating a newspaper article in most cases cannot be flattened out or ignored without losing important information (cf. Ihlström and Lundberg 2003; Ihlström and Åkes-son 2004). A web page can be considered as a sort of container from where the reader picks up the information s/he needs. Artificially separating what is considered to be the main body from the rest is an arbitrary operation and it would

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analyzing new features of infected web content in detection of malicious web pages

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...

متن کامل

Using web data for linguistic purposes

The world wide web is a mine of language data of unprecedented richness and ease of access (Kilgarriff and Grefenstette 2003). A growing body of studies has shown that simple algorithms using web-based evidence are successful at many linguistic tasks, often outperforming sophisticated methods based on smaller but more controlled data sources (cf. Turney 2001; Keller and Lapata 2003). Most curre...

متن کامل

Linguistic Features of English Textese and Digitalk of Iranian EFL Students

This study aimed at investigating the English textese of Iranian EFL learners by scrutinizing the linguistic features through a qualitative design. In doing so, 700 messages were collected from 43 MA Iranian EFL learners of both genders. The features were categorized and analyzed calculating the frequency and percentage. The findings of the study showed that Iranian EFL students used different ...

متن کامل

Web as a Corpus

The World Wide Web offers a unique possibility to create very large and high quality text collections with low manual work necessary. In this paper, we describe requirements for usable linguistic corpus and we present a routine for building such corpus from the web pages. Several important issues that must be resolved for successful processing are discussed. As a pilot study, we create a billio...

متن کامل

Are Blogs Edited? A Linguistic Survey of Italian Blogs Using Search Engines

Many blogs are written by people with no formal training in public writing; this could suggest a low level of editing and general correctness. A quantitative analysis of misspellings, however, shows that in their orthography Italian blogs are as well revised as conventional Italian newspaper texts. On the other hand, their editing is more careful than the editing of the average of Italian web p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006